The Advantages of a Biweight Metric in Clustering Microarray Data
نویسندگان
چکیده
Distance metrics are often the backbone of clustering algorithms. Yet certain distance metrics, such as one based on Pearson’s correlation, are sensitive to outliers. Microarray data tend to have outlying data points. Hence, we may intuitively believe metrics like one based on Pearson’s correlation may not be appropriate for clustering microarray data. Hardin, et al. (2007) show a metric based on Tukey’s biweight estimate of multivariate scale and location to be more robust than a metric based on Pearson’s correlation. The goal of this paper is to find a way to evaluate whether a metric based on Tukey’s biweight will create more valid clusters (based on known partitions in microarray data) than a metric based on Pearson’s correlation. We define an extreme outlier to be a data point that is sufficiently outlying with respect to the relationship between two genes. When an extreme outlier is removed, a metric based on Pearson’s correlation will often dramatically change its measurement of correlation, while a metric based on Tukey’s biweight will often display very little change in its estimate of correlation [11]. We consider a “robust” cluster to be a cluster that would be clustered the same with the removal of extreme outliers from gene pairings as without the removal. Our results suggest that a metric based on Tukey’s biweight will create more “robust” clusters than a metric based on Pearson’s correlation. As our clustering algorithm, we use Partitioning Around Medoids (PAM) [14].
منابع مشابه
Biweight Correlation as a Measure of Distance between Genes on a Microarray Abstract: The underlying goal of microarray experiments is to identify genetic patterns
The underlying goal of microarray experiments is to identify genetic patterns across different experimental conditions. Genes contained in a particular pathway or that respond similarly to experimental conditions should be coregulated and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses, we can partition genes of interest...
متن کاملBiweight Correlation as a Measure of Distance between Genes on a Microarray
Motivation: The underlying goal of microarray experiments is to identify genetic patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions should be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partit...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملبه کارگیری روشهای خوشهبندی در ریزآرایه DNA
Background: Microarray DNA technology has paved the way for investigators to expressed thousands of genes in a short time. Analysis of this big amount of raw data includes normalization, clustering and classification. The present study surveys the application of clustering technique in microarray DNA analysis. Materials and methods: We analyzed data of Van’t Veer et al study dealing with BRCA1...
متن کاملComposite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008